-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for 2/3/8-bit GPTQ Quantization Models #2330
Conversation
Hey @chu-tianxiang, what's the request rate / QPS for your throughput test? Any intuition on why we've seen ~2x tokens per second but lower throughput? |
I used the benchmark_throughput.py which adds all request before running the inference instead of sending at some request rate. |
Hey @chu-tianxiang, can you please update this pr to the latest master branch. |
looking forward to this getting merged! |
@chu-tianxiang I have tested this feature using model:https://huggingface.co/TheBloke/WizardCoder-33B-V1.1-GPTQ/tree/gptq-8bit--1g-actorder_True. It’s ok when setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Awesome! Thanks for the PR and apologies for the delayed review.
Gptq 8 bit doesnt work on v100, cannot compile as require sm80 and above (its marlin and quip kernels). Any plan to fix that since v100 is in the official supported list? |
@esmeetu I tested the model in the link and cannot reproduce the illegal memory access error. Could you please provide more details about the setup and code? @aliencaocao this PR doesn't include marlin or quip kernels, I guess you're talking about the |
Yes i meant the gptq hf branch. I figured it out myself by removing all quip and marlin codes and it works for me. |
There's already a pull request supporting varying quantization bit levels for GPTQ models, leveraging kernels from the AutoGPTQ repository. This PR presents an alternative approach inspired by exllamav2.
While exllamav2 doesn't natively support 2&3&8-bit GPTQ models, it possesses the essential components. In essence, EXL2 operates as a mixed-bit GPTQ model, 2&3&8-bit models can be seen as special cases. Although there are minor differences in scales and zero points, these can be easily adjusted.
Following is the comparison of latency and throughput of LLama2-7B under different bit quantization in single A100. 4-bit is the from main branch with cuda-graph fix while 3-bit and 8-bit are newly added. All measured using the
benchmark_latency.py
andbenchmark_throughput.py
scripts. (2-bit GPTQ models can hardly generate coherent output and is of no practical value, so I didn't include it below)This has not been tested on ROCm device yet.